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DESCRIPTION 

LONG-INTEGER MULTIPLIER 

The present invention relates to methods and apparatus for the 
multiplication of two long integers and the addition of a third long integer 
modulo a third long integer. Such multiplications must be carried out 
repeatedly during implementation of, for example, cryptographic algorithms in 
cryptographic processors such as those used in smart cards. 

The increasing use of cryptographic algorithms in electronic devices has 
established a need to quickly and efficiently execute long Integer modular 
multiplications. For example, smart cards and many other electronic devices 
use a number of cryptographic protocols such as the RSA, and others based 
on elliptic curve and hyper elliptic calculations. All of these protocols have, as 
a basic requirement, the ability to perform long integer modular multiplications 
of the form R = X.Y + Z mod N, although the addition of Z is not always 
required. 

Typically, with protocols such as RSA, the long integers X and Y are 
1024-bit, or even 2048-bit integers, and the multiplication operations must be 
carried out many hundreds or thousands of times to complete an encryption or 
decryption operation. It is therefore desirable that the cryptographic devices 
that perfomi these operations execute the long integer multiplications quickly. 

An aspect of carrying out such long integer multiplications is to break 
down the long integers Into a number of words and to successively multiply the 
words together in an iterative processes which produces a succession of 
intermediate results which are cumulated to obtain the final result. A feature of 
this technique is the necessity for summing a large number of addends of 
various lengths during each stage of the multiplication process. Therefore, the 
number of addends for any given bit position can vary significantly. 
Conventionally, such summation operations can be implemented using 
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Wallace trees, bit these often make use of rather more hardware, and 
introduce rather more delay, than is desirable. 

It is an object of the present invention to provide a method and 
apparatus for effecting long integer multiplication operations as quickly as 
possible. 

It is an object of the invention to provide a more efficient method and 
apparatus for the summation of a large number of addends, particularly where 
the number of addend bits varies as a function of the bit position in the sum. 

In one arrangement, an adder circuit for multiplying two long integers 
deploys a network of adders for summing a succession of words of the long 
integers to generate intermediate results. The number of addends varies as a 
function of bit position and the network of adders is designed to reduce the 
number of levels of adders in the network according to a maximum number of 
expected addends. An object is to adapt the network to include a number of 
adders that varies as a function of bit position. 

In another arrangement, an output stage may be provided that adds 
sum and carry outputs of the network representing an intermediate result. An 
objective is to avoid delay in passing a carry bit from this output stage back to 
the networi<, by retaining a most significant (carry) bit for use with a 
subsequent calculation output of the network. 

In another arrangement, an objective is to enable the network to 
commence a subsequent calculation with a new set of addends prior to 
completion of the previous calculation. The network of adders may be 
configured so that the output of the previous calculation is fed back to the 
networi< at an intermediate level between its highest (input) level and its lowest 
(output) level. 

According to one aspect, the present invention provides an adder circuit 
for summing a plurality of addends from multi-bit words comprising: 

a network of n-input carry-save adder circuits each having a first 
number of sum outputs and a second number of carry outputs, 
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the adder circuits being arranged in a plurality of columns, each column 
corresponding to a predetermined bit position in the sum, and being arranged 
in a plurality of levels, 

the first level receiving a number of addends from corresponding bit 
5 positions of selected ones of the plurality of words and 

the lower levels each receiving addends from one or more of (i) 
corresponding bit positions of other selected ones of the plurality of words, (li) 
sum outputs from a higher level adder circuit in the same column, and (iii) 
carry outputs from a higher level adder circuit in a column con^esponding to a 
less significant bit position, 

wherein the number of n-input adders in each column varies according 
to the bit position. 

According to another aspect, the present invention provides an adder 
circuit comprising: 

an input for receiving a plurality of addends; 

first summation means for summing a plurality of addends to produce 
an output comprising a high order part and a first and second low order part; 

a first feedback line for coupling the first high order part to a lower order 
position at said input, for a subsequent calculation; and 

an output stage including second summation means for summing the 
first and second low order parts to provide a first word output and a feedback 
register for retaining a carry bit from said second summation means and for 
providing said carry bit as input to said second summation means during a 
subsequent calculation. 

According to another aspect, the present invention provides a pipelined 
adder circuit for summing a plurality of addends from multi-bit words 
comprising: 

first summation means comprising a network of carry-save adder 
circuits, the adder circuits being arranged in a plurality of columns, each 
column corresponding to a predetermined bit position in the sum, and being 
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arranged in a plurality of levels, the first level coupled for receiving a number of 
addends from con^sponding bit positions of selected ones of Vne plurality of 
words and the lower levels coupled for receiving addends from one or more of 

(i) corresponding bit positions of other selected ones of the plurality of words, 

(ii) sum outputs from a higher level adder circuH in the same column, and (ill) 
cany outputs from a higher level adder circuit in a column corresponding to a 
less significant bit position, 

a first feedback line for coupling a first plurality of more significant bit 
outputs of the lowest level adder circuits to a corresponding number of less 
significant bit inputs of an intermediate level of adder circuits for a subsequent 
calculation, the intemiediate level being between said first and lowest level 
adder circuits. 

Embodiments of the present invention will now be described by way of 
example and with reference to the accompanying drawings in which: 

Figure 1 shows an array multiplier suitable for carrying out the 
multiplication operations, B.c + r = x.y + c+ z where x and c have a width of 64 
bits, while y, z and r have a width of 16 bits; 

Figure 2 shows a bit alignment of words to be added in a pipelined 
multiplier performing the calculation Rj = Xn-j.iyo + Zn-j-i + (x„.j-iyi + rj.i.o)By + (Xn^. 
iy2 + r].i,i)By^ + ... + (Xn-j.iyn.i + rj.i,n-2) By""^ + rj.i,n-i) By", where each of the x.y 
word products is denoted by Pj, split into a number of products, e.g. P0...P15 
together with a sum term denoted by Z; 

Figure 3 is a graph showing the number of addends, per bit position, for 
the summation of words of figure 2; 

Figure 4 shows a fragment of a conventional Wallace tree structure 
suitable for implementing the pipelined summation of words of figure 2; 

Figure 5 shows a fragment of an adaptive tree structure suitable for 
implementing the pipelined summation of words of figure 2; 

Figure 6 sho\A« a schematic block diagram of an unpipelined adder 
suitable for implementing the summation of words of figure 2; 
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Figure 7 shows a schematic blocl< diagram of a pipelined adder based 
on the structure of the adder of figure 6; 

Figure 8 shows a further fragment of the adaptive tree structure of figure 
5, suitable for implementing the pipelined summation of words of figure 2; 

Figure 9 shows a portion of an adaptive tree structure according to 
figure 5; and 

Figure 10 shows the insertion of a number of two-input carry-save 
adders for insertion into the adaptive tree structure of figure 9. 

To calculate the product X.Y + Z mod N where X, Y and Z are long- 
integer variables, eg. of the order of 1024 or 2048 bit length, the long-integer 
variables X, Y and Z are split into smaller "words" of, for example 32 or 64 bits 
in length. 

First, X and Z are split up into n words, generally each of length k, such 

that: 

X = Xn.iBx"-^ + Xn.2Bx""^ + + Xo, and 

Z = Zn-iBx""^ + Zn-2Bx""^ + ... + Zq 

where Bx = 2^. In one example, k = 32, and in another example k = 64. 
In this manner, X and Z are fragmented into a plurality of words each of 
length k bits. 

Then, the result R can be calculated as follows: 

R = ((((Xfi-iY + Z|,_i mod N )Bx + x^-zY + 2n-2) mod N)Bx + ...XoY + z©) mod N 
Thus, Rj = (Xn+iY + Zn-j-1 + f^-iB,^ mod N. 

First, we multiply Xn-i by the complete Y and add Zn-i; then we calculate 
the modulo N reduction. The result is Ro. 

Next, we multiply Xn-a by the complete Y, add z„^ and Ro-Bx to the result 
and calculate the modulo N reduction. The result is Ri. 
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Next, we multiply Xn-s by the complete Y, add Zr^s and Ri.Bx to the result 
and calculate the modulo N reduction. The result is R2. 

This procedure is repeated until we have used all words of X, xo being 
the last word of X to be processed, to obtain the final result R = Rn.i. 

However, a multiplier for Y being 1024-bits long is undesirable from a 
practical viewpoint. Therefore, we also break down Y, and thus Rj, into smaller 
"words" of, for example, 32 bits or 16 bits in length. 

Therefore, the basic multiplication Rj = (Xn-j-iY + ZrH-i + Rj-iBx) mod N, is 
also fragmented. 

We split Y and Rj into p words of m bits in length, ie. By = 2"^: 

Y = yp.iByP-^ + yp-2ByP^2 + ... + yo 
Rj = rj,p.iBy'*"'* + rj,p-2By^^ + ... + rj,o 

For simplicity, we first assume that the lengths of X and Y are the same, 
and that the size of the X and Y words are the same, so that p = n and m = k. 
Later, we will show what has to be changed when this is not the case. 

In this manner, X and Y are fragmented into n words each of length k 
bits. Then, 



f^=)^hiyo+2Hfi+()(r>fiyi+ij.i.o)B+o<bfiyfe+H^^ +...+()<rH-iyb.i+ri-i^a)Bf*'^ +0-1^1)6" 
^ ^ 

V ^ 

For the calculation of R|, we perform the following operations: 

First, we multiply Xn+.i by yo, add rj_i,_i = Zn-|_i and split the result into 

two equal parts: the lower part rj,o (m-bits) and the higher part q.o (k-bits): B.q,o 

+ ^1,0 ~ Xn-j-1- yo + rj-i.-i-n.o is saved as part of the outcome. 

Next, we multiply Xn^-i by y^ and add the previous carry word q,o. 

Moreover, we add Zq = rj.1.0 too. The result is again split into two equal parts: 
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the lower part rj,i and the higher part q,i: B.q,i + rj,i = Xn+i.yi + q,o+ rj-i.o rj.i is 
saved as part of the outcome. 

Next, we multiply Xn-n by ya and add the previous carry word 
Moreover, we add Zi = rj-i.i too. The result is again split into two equal parts: 
the lower part rj,2 and the higher part Cj,2: B.Cj,2 + r^z == Xn.j-1.y2 + q.i + rj-i, i.rj,2 is 
saved as part of the outcome. 

This procedure is repeated until we perfomn the last multiplication, by yn- 
1, ie. we multiply Xn-j-i by yn-i and add the previous carry word Cj,n-2. Moreover, 
we add Zn-2 = rj.i.n-2 too. The result is again split into 2 parts, respectively of k- 
and m-bits in length: the lower part rj.n-i and the higher part Cj,n.i: By.q,n.i + rj.n-i 
= Xn-i-i.yn-i + C|,n-2+ rj-i.n-2.r].n-i is savod as part of the outcome. 

The last step is the addition of q.n-i and Zn•^: rj,n = q.n-i + rj-i,n-i.rj.n which 
is saved as part of the outcome. 

Now Rj is complete and is larger than the Y variable from which it was 
derived by the length of one word of X. The size of Rj is preferably reduced by 
one word in a modulo N reduction, and the reduced result is then used as Rj 
during the calculation of the subsequent Rj^-i. 

The above calculation described the general procedure where the 
length of the X words (k) is the same as the length of the y words (m), ie. Bx = 

By. 

The X words may be different in length than the Y words. For example, 
if k/m > 1 , k = 64 and m = 1 6, then Bx = By"^ , then: 

1. The addition of z is done during the first k/m (= 4, in the example) 
multiplications and the addition of Rj starts thereafter. 

2. The canry word qj is k/m (= 4) times larger (4m bits In length) than the 
result rj.i (m bits in length). 

3. The last step consists of the addition of the carry word and the 
remaining part of Rj, which are both 4m bits wide. This addition might be done 
by the same multiplier by choosing y = 0 in k/m steps, where in each step 
words of m bits are added. 

Thus, in the basic operation, omitting all indices: 
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B.c + r = x.y + c + z 

During the first operation, c = 0, z consists of k/m words of Z followed by 
all words of r. During the last k/m operations, y = 0. x is kept constant for the 
5 complete set of operations for each R|. 

The same multiplier as performs the x-y multiplication can be used for 
modulo N reduction. After a complete set of multiplications by a word of X, ie. 
X, the result Rj is enlarged by one k-bit word. It must then be reduced by k-bits 
by modulo N reduction to retrieve the original length prior to computation of the 
10 next Rj. 

There are several possible algorithms for modulo reduction (eg. 
Quisquater, Barret, Montgomery, etc), but they all use the multiplication of the 
form: 

15 Rj = Xred.N + Rj. 

where Xred (having a size of k bits) times the modulus, N is added to the 
result. Alternatively, Xred is subtracted by using the two's complement N' 
instead of N. The methods differ in the way that the factor Xred is calculated. 
20 For the Montgomery reduction, the result must also be divided by Bx, ie. the 
first word, being all zero) is omitted. 

The same basic operation can be used for the reduction: 

B.c + r = x,y + c + z 

25 

with B = By, r = rj.i, x = Xred, y = N| and z = rjj. 

The above multiplication operations can be carried out in a number of 
possible multipliers. However, an array multiplier is a conventional way of 
implementing such a multiplier. An example is shown in figure 1. 
30 The exemplary anray multiplier 10 is a 64 by 16-bit multiplier, but other 

bit configurations can readily be used. The anray multiplier 10 calculates each 
term in the expression Rj, in the form B.c + r = x.y + c + z. x and c have a 
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width of 64 bits, y, z and r have a width of 16 bits, c, both as input and output, 
consists in fact of two temns, Cc and Cs. 

The basic element 12 of the array multiplier is shown inset in figure 1 
and includes a multiplier 13 receiving inputs x and y, and an adder 14 
receiving product temns x.y, carry and sum inputs Si and Ci to produce carry 
and sum outputs Co and So therefrom. 

The an^ay multiplier 10 consists of seventeen 'layers' or levels', 'addl', 
'add2, ... 'addl 7'. The first sixteen layers addl ... add 16 perform the 
multiplication and addition. The last layer, addl 7, and the right-most elements 
in each layer) perfomi only additions. The outputs are 16-bit r(15:0) and a 63- 
bit carry temn Cc'(79:16) and a 63-bit sum temn Cs'(79:16). The sum of the 
cany temn Cc' and the sum temn Cs' is the canry tenn c in the calculation: 

B.c + r = x.y + c + z. 

In fact, this temn is never calculated. Instead, the calculation: 
B.(c' + s') + r = x.y + c' + s' + z 

is performed. The basic element 12 of the array multiplier 10 performs 
the bit calculation (Co, So) = y*x + d + S|. The adding of z is done by the 
rightmost adder of every layer except the first one. The seventeenth layer 
consists only of adders, which is necessary for the addition of r(15). A 
drawback with the use of this implementation of array multiplier is the low 
speed at which it can operate, as a result of cumulative delays from seventeen 
layers of logic. 

Therefore, it is advantageous to use a pipelined multiplier in which the 
processing of the various stages can be overlapped to reduce the computation 
time. With reference to figure 2, the various addends required during the 
multiplication process are shown schematically. For a 64 by 16-bit multiplier, 
the process requires the addition of: (i) 16 product terms Po, Pi, ... Pis with Pj 
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= X(63:0) * YO); (il) a 16-bit Z term Z(15:0); (HI) a 63-blt carry term Cc(62:0) 
and (iv) a 63-bit sum term Cs(62:0). 

The result Rj(15:0) is output and the intermediate temris Cc'(78:16), 
Cs'(78:16) are canried into the calculation of the next term Rj+i. 

Figure 3 gives the number of addends per bit position. From bit position 
0 through to bit position 1 5, the number of addends increases linearly from 4 
up to 19 as more P temns are included. Then it decreases by 1 for bit 16, 
since there are no more z-bits. The number of addends then remains constant 
at 18 right through to bit 62 when the carry and sum temris Cc and Cs drop out. 
Thus, a reduction in the number of addends by 2 to 16 occurs for bit position 
63. Finally, from bit position 63 on up to bit position 78, the number of 
addends decreases linearly from 16 down to 1 as each successively higher P 
temi drops out. 

A Wallace tree is a conventional way of configuring an array of carry- 
save adders for the perfomnance of the addition operations for a large number 
of addends, using an optimised number of levels. Figure 4 sho>A» a fragment 
of such a Wallace tree 40. 

Each adder adds three inputs and has two outputs: a carry and a sum. 
A Wallace tree assumes that the number of addends per bit position is 
constant, and figure 4 shows the configuration of tree 40 that would be 
appropriate for implementing the required additions indicated by figure 3. In 
this case, the tree is configured for 19 addends per bit position, since this 
maximum occurs for bit position 15. 

At the first level, indicated as 'layer 1' on the drawing, there are six 
carry-save adders 41 for each bit position, eg. bit position j as shown. These 
six cany-save adders provide a total of eighteen inputs 42, six sum outputs 43 
and six carry outputs 44. Furthemiore. there is one additional input 45, which 
is added into level 3 ('layer 3'). This gives the required total of nineteen inputs. 

The six sum outputs 43 are added in next level 2 by carry-save adders 
46. The six carry outputs 44 are added in the next level 2 of the tree but in the 
cany-save adders 56 of the next bit position to the left Indicated as j+1. The 
carry-save adders 61 of the first level for the preceding bit position j-1 also 
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provide six carry outputs 64 which are provided to the adders 46 of level 2 for 
bit position j. The conventional Wallace tree assumes that the number of carry 
inputs (eg. 43, 44) equals the number of carry outputs, which is always the 
case when the number of inputs for each bit position at level 1 is the same. 
5 Such a Wallace tree gives the minimum number of levels for a given 

number of addends according to the table below: 



Number of addends 


Number of levels 


1.2,3 


1 


4 


2 


5,6 


3 


7. 8.9 


4 


10-13 


5 


14-19 


6 



It has been recognised that particularly - though not exclusively - for 
10 the computations required for the expression R = X * Y + Z mod N discussed 
above, the number of adders required for a given number of addends can be 
reduced, particularly when the number of addends is variable through the 
calculation. 

Figure 5 illustrates a section or fragment of the basic structure of an 
15 exemplary 'adaptive tree' or network 70 according to the present invention, for 
each of bit number positions j+1. j. and j-1. each bit position corresponding to 
a column in the tree. In the fragment of figure 5, the number of addends is 18 
in each bit position (column). This basic structure is used for all bit positions, 
but the number of carry-save adders at each level and in each bit position is 
20 determined independently according to the number of addends required at that 
respective bit position. Figure 8 shows a further section of the adaptive tree 
70, specifically for bit positions 0 through to 8, respectively requiring 4 through 
to 12 addends (see figure 3). The adaptive tree therefore comprises a tree 
structure of adders which is structured to minimise or reduce the number of 
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adders required where there are variable numbers of input bits for the 
respective input bit positions. 

The determination of the structure of the adaptive tree or network is 
established according to the following rules. 
5 At the first level, the number of carry-save adders 71 in a given bit 

position is set to the number of input addends divided by three and rounded 
down to the nearest whole number. For example, for sixteen inputs, five 
adders are required. For eighteen inputs as illustrated in figure 5, position j, 
six adders 71 are required. 
10 At each of the subsequent levels, the number of adders for the given bit 

position is determined according to the expression: 
(number of adders for bit position j at level n) = 
{(number of sum outputs from level n-1 in bit position j) + 
(number of unconnected inputs of level n-1 in bit position j) + 
15 (number of carries of level n-1 in bit position j-1)} 

divided by 3 and rounded down to the nearest integer. 

Thus, referring specifically to figure 5, at an intermediate portion of the 
tree 70 requiring eighteen inputs for bit position j, at level 1, the number of 

20 adders 71 Is six. At level 2, according to the formulation above, the number of 
adders 72 is INT{(6 + 0 + 6) / 3} = 4. At level 3, the number of adders 73 is 
INT{(4 + 0 + 4) / 3} = 2. At level 4, the number of adders 74 is INT{(2 + 2 + 2) / 
3} = 2. At level 5, the number of adders 75 is INT{(2 + 0 + 2) / 3} = 1 . Finally, 
for level 6, the number of adders 76 is INT{(1 + 1 + 1) / 3} = 1. It will be noted 

25 that for each of the bit positions j+1 , j and j-1 , for eighteen addends, there is a 
saving of one carry-save adder at level 3 in each bit position. 

Referring specifically to figure 8, at one end of the tree 70 further 
savings are made, because the number of carries from the right is smaller - 
because of the increasing number input bits - than in the Wallace tree case. 

30 For example, at bit position 7, eleven addends are present. A conventional 
Wallace tree would suggest five levels. In fact, in this position, four levels, 
respectively having three, two, two and one adder(s) are required. 
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In some cases the number of levels can sometimes be reduced still 
further by the addition of a two-input carry-save adder at strategic positions 
within the network. First, the design is implemented using only three-input 
carry-save adders to form a network 70 according to the strategy defined 

5 above. To identify the strategic positions in which to insert a two-input carry 
save adder, it is necessary to identify, in each level ('Ln') and bit position ('Bj'), 
locations where the number of inputs to that bit position Bj and level Ln 
exceeds a minimum number, eg. two. Where it does, a two-input carry-save 
adder is inserted at a level (eg. U-i or Ln-2, etc) above the location, at which 

10 level there are two unconnected addends. This effectively moves one input to 
the next higher order bit position Bj+i. This in turn may result in a 
consequential exceeding of the allowed number of outputs for the next bit 
position and therefore the procedure must be repeated a number of times until 
the number of inputs for all bit positions does not exceed the allowed number. 

15 For example, referring specifically to figure 9, there may be a 

decreasing number of inputs for the higher order bits resulting in a higher than 
necessary number of layers. The maximum number of inputs per bit position 
is three, so one level of adders should be sufficient. In figure 9, we have three 
inputs for the adder 100 of bit position 58 and a carry output 101 from an adder 

20 in bit position 57 (not shown). We have two inputs for each of the adders 102, 
103 of bit positions 59 and 60 respectively, and one input for bit position 61. 
For bit position 59, we have three (instead of the desired two) outputs from 
level 1: one carry output from bit position 58 and two unconnected word inputs. 
Three levels (labelled layer 1, layer 2 and layer 3) are required because of the 

25 carry 101 from bit position 58 to 59 and in the same way from bit 59 to 60. This 
gives two additional layers. 

With reference to figure 10, we can mitigate this situation by using extra 
two-input, two-output adders 110, 111 (labelled as 'CSA2', in contrast to the 
three-input, two-output cany save adders, •CSA3'). Such adders do not 

30 reduce the number of inputs in total, but they do for that bit position by one. 
The CSA2 adder 110 increases the number of inputs for the next higher bit 
position 60 from two to three so the problem is moved to bit position 60 instead 
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of bit position 59. However, CSA2 adder 111 is also Inserted which reduced 
the number of inputs to level 1, bit position 60 from three to two. The 
consequent Increase in the number of inputs at bit position 61 from one to two 
does not matter. 

In principle, It has been recognised that strategic handling of pairs of 
addends with two-Input adders at higher levels in a particular bit position can 
result In a further decrease in the number of levels. In other words, locally 
increasing the summation capacity with two input adders in one or more 
adjacent higher order positions can consequently reduce summation capacity 
required at lower levels, ultimately reducing the number of levels, without 
requiring an additional three-input adder. 

This solution increases the number of addends for a left neighbour 
which might, as a result, get too many Inputs. If so, a number of two Input 
adders may need to be inserted in a level until there Is a bit position with a 
sufficiently low number of Inputs as shown by bit position 61. 

In a general sense, a procedure for Inserting additional two-input cany- 
save adders may be defined as the following steps. Firstly, for a given number 
of levels, find a first location in the network having a bit position Bj and level Ln 
where the number of outputs at that first location is greater than two (eg. three, 
instead of two) and where at some higher level there are two unconnected 
addends. Secondly, in respect of that first location, insert a two-input carry- 
save adder at a second location having the same bit position Bj but having a 
level (eg. Ln-i, Ln-2, etc) above the first location, at which location there are two 
unconnected addends. 

The procedure may need to be repeated a number of times until the 
number of inputs for all bit positions does not exceed the allowed number. 

With reference to figure 6, the adaptive tree may be used in an 
unpipelined adder configuration 80. In this an^ngement, the adaptive tree has 
a maximum of six levels 81, 82 ... 86 for summing all the addends of figure 2. 
The adder sums all sixteen products Po ... Pis. Z and the feed back carry tenn 
Cc(62:0) and sum term Cs(62:0) using an adaptive tree of six levels. The 
output 87 of the tree is registered, such that the higher order part of final carry 
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term Cc'(78:16) and the higher order part of final sum temn Cs'(78:16) output 
are fed bacic on feedbacic line 91 and shifted to bit positions (62:0) as input for 
the next calculation. The lower order part of canry term Cc'(15:0) and sum 
term Cs'(15:0) are summed by an additional full adder 88 and saved to register 
5 89, which is the temi V in the formula B.(c'+s') + r = x.y + (c'+s') +z. 

This later addition of the lower order parts of carry and sum terms 
Cc'(15:0) and Cs'(15:0) itself generates a further single bit carry term, 
identified In figure 6 as c"i6. This single bit carry term is fed back for inclusion 
in the next summation by full adder 88, as indicated by the feedback line 90. 

Thus, in a general sense, the additional full adder 88 and register 89 
exemplify an output stage which add the sum and canry terms to provide a first 
word output of a final result, and to retain a carry bit c"i6 to be used as input 
for a subsequent stage of the calculation in which the main adder array 
generates a further, higher order sum and carry temn for addition by the output 
stage. 

Altematively, the canry term c"i6 could be fed back to level 1, bit 0 of the 
adaptive tree as shown at 81, since it has the same weight as Cc'(16) and 
Cs'(16). A disadvantage of this technique is that the adaptive tree must wait 
for the c"i6 output of the full adder 88 before commencing a subsequent 
calculation. Therefore it is preferable to use the full adder 88 to add the c"i6 
term. 

The canry bit c"i6 is cleared, like Co' and Cs', at the beginning of each 
new multiplication. 

In a further arrangement, as shown in figure 7, the adaptive tree 180 
can be given a pipelined configuration, having a number of levels 181 ...187. In 
this case, it is generally necessary to feedback the higher order part of the 
carry Cc'(78:16) and sum Cs'(78:16) to a preceding level (ie. an 'intemriediate' 
level 185) instead of the first level 181. Thus, in the specific arrangement 
shown in figure 7, rather than wait for the higher order part of final carry tenn 
Cc'(78:16) and the higher order part of final sum temi Cs'(78:16) output from 
the last level 187 to be fed back to level 1 prior to commencement of the next 
calculation, these terms can be added in at level 5, as shown. Although this 
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arrangement increases the number of levels by one, to 7, the delay is reduced 
from a six level delay as in the arrangement of figure 6 to a four level delay as 
in the arrangement of figure 7. 

In this configuration, in a general sense the feedback line 191 couples 
5 the more significant bit output of the adder circuits to a corresponding number 
of less significant bit inputs of an intermediate level of adder circuits, it may be 
necessary to provide an intermediate level register 191 for temporarily holding 
the summation results from the first four levels 181 ...184. 

This increases the speed of operation by a factor of 1 .5, at the cost of a 
10 significant increase in hardware. In the example given, an additional 275 
registers are required to service the additional level. 

Another advantage of the adaptive tree occurs for pipelined versions. In 
figure 7, most of the adders of the lower order bit numbers, where at most four 
levels are required, are placed in the first four layers, thereby reducing the 
15 number of registers. By contrast, the Wallace Tree requires these adders to 
be placed In the lower layers. This therefore requires far more level 4 
registers, since the Wallace tree does not reduce the number of inputs in the 
upper levels for the lower bit numbers. 

The anrangement of figure 7 may also include the output stage 
20 188.. .190 as described in connection with the output stage 88.. .90 of the 
arrangement of figure 6. 

Other embodiments are intentionally within the scope of the 
accompanying claims. 

25 



